Red Wine Analysis by Elzani Pretorius

For this analysis I will be investigating a red wine data set. This data set contains 1599 samples of red wine and for each sample it has information on the following chemical properties:

Furthermore the quality rating for each sample, as determined by wine experts, is also provided. For my investigation of this data set I aim to understand which properties have the greatest influence on the quality rating, and look at the relationship each property has with quality.The quality ratings can range from 0 (very bad) to 10 (very excellent).

Below you can see the first six entries for each variable in the dataset.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Univariate Plots Section

In this section I will start to explore the dataset by looking at variables individually using visualizations and numerical summaries. By doing this I hope to start building better intuition about the data, identify outliers, and visualise the spread for each variable.

red$X <- NULL
## [1] 1599   12
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Above I found the dimensions of the data: 1599 observations and 12 variables. I removed column names “X” since it only contains a list of indexes of the observations and therefore will not be useful in this analysis. The summary of each variable can also be seen above.

Next, I will look at the distribution of the quality variable in this dataset:

## [1] 280  12

It seems that the majority of the wine samples received a quality rating of either 5 or 6 - an average rating. Based on the summary of the data seen previously and the histogram above, the highest rating is 8 and the lowest is 3. There are 280 samples where the rating is either lower than 5 or higher than 6. This is a small proportion of the 1599 data samples, roughly 17%. Assuming that this data does indeed give a good representation of the distribution of wine quality ranging from poor to excellent, it would be better to use a larger dataset. This would give us more non-average (not 5 or 6) samples to work with. Increasing the dataset size would therefore help increase the reliability of the conclutions drawn in terms of which chemical properties influence wine quality and to that an extent they influence the quality of the wine.

I will now look at the distribution in each variable by plotting histograms.

This first histogram gives a quick overview of the distribution of the variables in this dataset, I would like some more detail so I will plot these histograms again, but at a larger scale and adjusting the x-axis to show more detail.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500

Looking at the four graphs above, the one for citric acid and the one for residual sugar stands out. The citric acid distribution is somewhat irregular and the residual sugar distribution has a long tail, due to outliers. The first two graphs for fixed acidity and volatile acidity both resemble a normal distribution, but slightly skewed to the right. Numerical summaries for these properties are also provided and reflect these observations. For example the max value for residual sugar is significantly higher than the 3rd quantile value, indicating that it is an outlier.

##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density      
##  Min.   :0.9901  
##  1st Qu.:0.9956  
##  Median :0.9968  
##  Mean   :0.9967  
##  3rd Qu.:0.9978  
##  Max.   :1.0037

The chlorides distribution has some outliers which is also seen in its numerical summary, the total sulfur dioxide and free sulfur dioxide graphs are both right skewed and the density graph resembles a normal distribution.

##        pH          sulphates         alcohol     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90

From the graphs above, the distribution of pH in this dataset resembles a normal distribution, the sulphates distribution seems to have some outliers and the alcohol graph is skewed to the rights.

I will now replot the distributions for residual sugar and chlorides, but using only the data in the 90% confinence interval and thereby removing extreme outliers.

After removing outliers, I get a skewed right distribution for residual sugars and a distribution starting to resemble a normal distribution for chlorides.

Univariate Analysis

Structure of Dataset

As shown earlier, the red wine dataset has 1599 observations and 12 variables. The first 10 variables are chemical properties associated with the wine and the last variable indicates the quality rating given to the wine. Furthermore, the property variables are numerical and the quality variable is categorical.

Main feature of interest in dataset

For this dataset, I am interested in determining how the given chemical properties are related to the quality of the wine and which property has the strongest influence on quality.

Hypothesis based on observations from graph

With regards to the near-normally distributed data, I am interested in determining whether either side (or “tail”) of the distribution corresponds to good or bad quality wine. For example when considering the pH distribution, could it be that a lower- range pH leads to a good quality wine, a higher-range pH leads to a poor quality wine and pH in the middle leads to an average wine?

For the graphs skewed to the right- for example residual sugar, I am speculating that the tail, higher residual sugar data correponds to higher quality wine whereas the residual content for average and poor quality wine is similar.

These are just some of the initial questions and intuitions that come to mind when looking at univariate visualizations.

Unusual distributions in data

The chlorides variable and residual sugar variable distributions both had outliers making it difficult to see the shape of the distribution initially. I removed these data points to get a better look at these distributions but did not permanantly remove them from the dataset since they may be of interest later in the analysis.

Bivariate Plots Section

I will now start looking at relationships between variables. In particular, I want to investigate relationships between chemical property variables and the quality variable. Relationships between the chemical properties will also be interesting to see. For example an interesting relationship to consider is that between citric acid and pH, one would expect to see a positive correlation with pH increasing as citric acid increases.

Quality vs pH of Sample

For the pH-quality scatter plot the majority of the data clusters between 3.1 and 3.6. There does not seem to be a noticable increase or decrease in quality rating when pH is increased or decreased. Low and high quality ratings are found at both low and high pH levels. In general good, bad, and average quality wine have similar pH levels.This suggests that pH did not play an important role in the quality of the wine. Referring back to my hypothesis based on observations of the pH histogram, it seems that my speculation was wrong since there is no noticeable change in wine quality on iether side of the bulk of the data.

Quality vs Residual Sugar in Sample

The data used to create the quality - residual sugar plot was taken from the original dataset and hence still contains the outliers for residual sugar. Interestingly, the quality of these high residual sugar samples range from average to low quality. I would have expected higher residual sugar wines to have a positive effect on the wine quality rating. To get a better look at the bulk of the data I will remove these data points, only keeping data in the 90% confidence interval.

The majority of wine samples have residual sugar levels between 1.5 and 3. Within this range there are some samples that have very similar residual sugar content but very different quality ratings. For a residual sugar content close to 2 there are wine samples with very low quality ratings, +/- 2.5, as well as wine samples with very high quality ratings, +/- 8.25. However most of the quality ratings in this range are between 5 and 6, in the average range. This plot does not therefore provide any indication that increasing or decreasing the sugar content of the wine improves its quality rating.

pH vs acidity

Next I will investigate how changing fixed acidity effects the pH of the wine.

## 
##  Pearson's product-moment correlation
## 
## data:  red$pH and red$fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

This is an interesting illustration. From the correlation coefficient and plot there appears to be a strong negative correlation between pH and fixed acidity. This makes sense since one would expect that as fixed acidity increases, pH would decrease.

Lets see if there is a similar trend with volatile acidity:

## 
##  Pearson's product-moment correlation
## 
## data:  red$pH and red$volatile.acidity
## t = 9.659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1880823 0.2807254
## sample estimates:
##       cor 
## 0.2349373

There seems to be a slight upwards trend to this plot, as can also be seen from the trend line. The trend shows that pH increases as volatile acid increases, a positive correlation. This is unexpected since one would expect pH to decrease when acidity increases, as was seen in the previous graph.

Lets look at the relationship between volatile acidity and fixed acidity to see if this will help in explaining the above trend. I will used data in the 90% confidense interval for volatile acidity.

## 
##  Pearson's product-moment correlation
## 
## data:  red.v_acidity$volatile.acidity and red.v_acidity$fixed.acidity
## t = -9.6263, df = 1426, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2951096 -0.1976758
## sample estimates:
##        cor 
## -0.2470169

The plot created for volatile acidity vs fixed acidity as well as the calculated correlation coefficient indicate that there is a small negative correlation between these variables.

Citric acid is one of the predominant fixed acids found in wine, along with tartaric, malic, and succinic acid. I would therefore expect the citric acid- ph scatter plot to resemble the fixed acid pH plot.

Clearly, from the graph, one can see that there is a predominantly negative correlation between pH and citric acid. Increasing the citric acid content of wine leads to a decrease in pH, as expected.

Relationships between quality and remaining chemical properties

I will now create scatter plots for the remaining properties.

First, I will plot the relationship between quality and fixed acidity.

From this plot one cannot see an obvious increase or decrease of quality with fixed acidity. Most of the data for good, bad, and average quality wine lies between 6 and 10 on fixed acidity. Calculating the correlation coefficient using data in the 90% confidense interval for fixed acidity:

## 
##  Pearson's product-moment correlation
## 
## data:  red.fixed$quality and red.fixed$fixed.acidity
## t = 3.9067, df = 700, p-value = 0.0001026
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07286638 0.21771966
## sample estimates:
##       cor 
## 0.1460759

There exists a very small positive correlation between fixed acidity and quality.

Next I will look at the relatioship between quality and volatile acidity, in the 90% interval.

The plot shows a gradual decrease in quality as volatile acidity increases. Finding the correlation coefficient also shows that there is a moderate negatice correlation between quality and volatile acidity:

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

The data for citric acid seems quite dispersed on this plot. There is quite a lot of data with citric acid content close or equal to zero. I will plot the median and mean citric acid content for each quality rating next. I will also find the correlation coefficient.

## 
##  Pearson's product-moment correlation
## 
## data:  red$citric.acid and red$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

Based on both mean and median citric acid content at each quality rating, the plot shows a positive correlation. Quality seems to increase with citric acid content, but based on the correlation coefficient (0.226), this correlation is small.

For the above graph the majority of the data lies between 5 and 6 on the quality rating axis. There does not seem to be an increase of decrease of quality with chloride content of wine. This could be because the chloride content is so small and the change in chloride content is not big enough to cause a noticeable difference to the wine.

Let’s look at quality’s relationship to free and total sulfur content:

The above two graphs look similar. This is not unexpected since free sulfur dioxide is included in total sulfur dioxide. As with many of the other graphs, most of the data lies at quality ratings of 5, 6, anf 7 and increasing the sulfur dioxide does not have a huge impact on quality. Looking at the data clustered around quality readings of 5, 6 and 7, however there are more high sulfur content samples at the lower quality rating of 5 than at 6 or 7.

Looking at density vs quality, the density of wines differ very little. Therefore density does not have a strong impact on the quality of wine.

Most of the wines had a pH between 3 and 3.5. Within this range one finds wines of good, bad, and average quality.

There seems to be an increase of quality with sulphate content. The correlation coefficient calculated below indicates this is small.

## 
##  Pearson's product-moment correlation
## 
## data:  red$sulphates and red$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

I will use data in the 90% confinence interval to plot the relationship between quality and alcohol.

## 
##  Pearson's product-moment correlation
## 
## data:  red$alcohol and red$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Based on the plot and correlation coefficient, alcohol content has a moderate, positive correlation with quality.

Lastly I will visualise the correlation between each variable:

In terms of quality, this correlation visualization once again highlights the negative correlation between quality and volatile acidity as well as the positive correlation between quality and alcohol content.

Other noticeable positive correlations:

Other noticeable negative correlations:

Bivariate Analysis

The bivariate plotting section showed weak correlations between many chemical properties and quality. The correlations between quality and fixed acidity, residual sugar, free sulphur dioxide and pH were particularly small. The volatile acidity and alcohol properties stood out, and some strong correlations were seen between the chemical properties themselves. I will now discuss in more detail my main findings from the bivariate plotting process.

Exploring pH and acidity in wine

It was found that pH did not have a significant effect on quality. However, pH did not vary much in the samples. The majority of samples had a pH between 3.1 and 3.5. Therefore it is more accurate to say that within this pH range, pH does not have a significant effect on the perceived quality of a wine.

While on the topic of pH, another interesting observation was the relationship between volatile acid and pH. Contrary to what was expected, pH increased with increasing volatile acidity. This trend was not very strong but noticeable nonetheless. Looking at the quantities of volatile acidity (comprised mostly of acetic acid) in the data, it is consistently considerably lower than the ammount of fixed acid in samples. This could explain why we see this unusual trend with volatile acidity, this trend could be due to an increase of fixed acid in the samples. Furthermore research suggests that generally pH is a quantitative assesment of fixed acidity, not volatile acidity. Looking at the relationship between fixed acidity and pH, the trend is inline with what is expected. Increasing the fixed acidity results in a decrease in pH. Citric acid, a fixed acid, consequently also shows this negative correlation.

The correlation between quality and volatile acidity is moderate, with a correlation coefficient of -0.39. Literature claims that volatile acidity is closely associated with quality. Winemakers monitor volatile acidity and use it as an indication of spoilage.

Other interesting observations

The ammount of residual sugar in a sample did not have a big effect on the quality rating it received. The residual sugar content in these samples were low, mostly between 1.5 and 3 g/L. This is in the range 0-9 g/L indicating that these are dry red wine samples.

A moderate positive correlation was found between alcohol content and quality rating of wine. This and volatile acidity content were the strongest quality correlations found in this investigation. The strongest correlations overall between independant variables were that of pH and fixed acidity (negative correlation), density and fixed acidity (positive correlation), and citric acid and volatile acidity (negative correlation.) This can be seen on the the correlation illustration created in the previous section.

Multivariate Plots Section

Based on my bivariate plot section I would like to investigate in more detail the relationship between alcohol content and quality as well as volatile acidity and quality. I will start by splitting the quality variables into the following categories:

The above categories were chosen based on the first graph in this report showing the distribution of quality and the numerical summary supporting it.

Alcohol vs Volatile Acidity by Quality Category

Since volatile acidity and alcohol are the properties that have the strongest correlation with quality, I would like to see the relationship between them for each quality category. For this plot I will use data in the 90% confidence interval for alcohol and volatile acidity. I will be using the mean volatile acidity for each alcohol content data point.

## 
##  Pearson's product-moment correlation
## 
## data:  red.temp$volatile.acidity and red.temp$alcohol
## t = -9.1306, df = 1166, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3110678 -0.2039796
## sample estimates:
##        cor 
## -0.2583171

The plot shows that in general, for a given alcohol level the high quality wine had the lowest volatile acidity and the low quality wine had the highest volatile acidity. Furthermore a small overall downward trend is noticed, where volatile acidity decreases as alcohol increases.

Lets calculate the correlation coefficients again to see which property has the strongest impact on quality:

## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  red$quality and red$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

These coefficients indicate that alcohol content had the greatest impact on wine quality.

Acidity in wine

Next, I want to plot the relationship between volatile acidity and fixed acidity for each quality category. I am once again using data in the 90% confidence interval. I will be plotting the mean volatile acidity for each fixed acidity data point.

## 
##  Pearson's product-moment correlation
## 
## data:  red.temp1$volatile.acidity and red.temp1$fixed.acidity
## t = -10.04, df = 1272, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3211055 -0.2193057
## sample estimates:
##        cor 
## -0.2709631

The correlation coefficient and graph show the small negative correlation between these two properties. For the majority of the plot, the highest volatility corresponds to low quality wine and the lowest volatility corresponds to high quality wine.

Investigating the distribution of volatile acidity for each quality category:

The histogram shows that the high quality wine samples tend to have lower volatile acidity.The boxplots confirms this with the mean (indicated by the cross) and median volatile acidity being lowest for high quality wine data.

The correlation illustration showed a strong negative correlation between citric acid and pH. Lets see how quality is influenced by this relationship:

For the most part, the high quality wines have the highest citric acid content at a given pH.

The correlation illustration showed a strong positive correlation between density and fixed acidity, the plot below investigates this. Mean density is plotted against fixed acidity for the 90% confidence interval.

The plot shoes that an increase in fixed acidity results in an increase in density. At a given fixed acidity, good quality wines generally have lower densities than poor quality wines.

The correlation illustration also showed a strong correlation, this time negative, between pH and fixed acidity. The graph is shown below:

The plot illustrates the negative correlation between these two variable but there is no obvious trend in quality as with the previous plot

Exploring the Alcohol Property

There is a negative correlation between density and alcohol, this can be seen in the plot below:

Density does not have an obvious impact on quality at a given level of alcohol content.

The histogram and boxplots below show the alcohol distribution for the red wine samples, providing additional detail on quality as well.

Clearly higher alcohol content is associated with better quality wines.

Multivariate Analysis

The relationship between volatile acidity, alcohol, and quality indicated that higher quality wines tend to have lower volatile acidity at any given alcohol level. This supports the fact that volatile acidity indicates spoilage in wines, and therefore indicates a poorer quality wine.

A small negative correlation was observed between fixed acidity and volatile acidity. This may help explain why pH was shown to increase with increasing volatile acidity in an earlier graph- it could have been due to the lower levels of fixed acidity at high volatile acidity levels. Once again, this graph together with subsequent boxplots and a histogram showed that high quality wines generally have lower volatile acidity levels.

The plot of citric acid vs pH showed that these two properties have a strong negative correlation. Furthermore this graph showed higher quality wines to have greater citric acid content at a given pH level. Citric acid is a fixed acid, fixed acids impart the tartness that is a fundamental feature in wine, according to literature. Therefore it makes sense that citric acid has a positive relationship with wine quality.

Plotting density vs fixed acidity one sees that better quality wines have lower densities at a given fixed acidity. Interesting, density does not have this same impact on quality when plotted against alcohol. This could be because the relationship between quality and alcohol is much stronger than the relationship between density and quality, unlike in the density vs fixed acidity graph where both these properties had similar sized relationships with quality.

Finally, the boxplots and histograms showing the distribution of alcohol for each quality supports the hypothesis that higher quality wines have higher alcohol content.


Final Plots and Summary

I will now select three plots illustrating the main and most interesting findings of this study.

Plot One

Description One

Plot one is an important plot since it shows the relationship between quality and alcohol. Alcohol was found to be the chemical property with the strongest influence on wine quality. A positive correlation coefficient of 0.476 was calculated for this relationship. This is considered to be a moderate uphill relationship. From this graph one can see that the data points tend to a higher quality rating as alcohol content is increased.

Plot Two

Description Two

Having seen that alcohol and volatile acidity have the biggest impacts on quality, I used the plot above to illustrate their relationship with quality and each other in more detail. This plot helped illustrate that at a given alcohol content, low volatile acidity corresponds to better quality wine.

Plot Three

Description Three

Citric acid has a strong negative correlation with pH, as expected, and a small positive correlation with quality. This third plot shows that overall, at a given pH, a wine with a higher citric acid content received a better quality rating. Citric acid is said to add a liveliness to wine and help to bring out it’s fruity flavors.

Reflection

Initially I thought that pH and refined sugars would have the greatest impact on wine quality. It was interesting to find that these did not play such a big role, and that alcohol and volatile acidity had the biggest impact on quality.

The majority of this data set was made up of samples that were classified as average by wine experts. This may be a true representation of quality distribution in dry red wines, however it makes it difficult to analyse the wines at either extremes of the spectrum. Especially since the actual number of samples with non-average quality were so low - there were only 280 samples out of 1599 where the rating is either lower than 5 or higher than 6. It would be interesting to perform this same analysis on a dataset containing a greater number of red wines having poor or good quality ratings. This may lead to new insights into how and to what an extent the chemical properties influence wine quality.

Resources

http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity

http://www.calwineries.com/learn/wine-chemistry/acidity

http://winefolly.com/tutorial/wines-from-dry-to-sweet-chart/

http://eckraus.com/citric-acid/